Analysing grouping of nucleotides in DNA sequences using lumped processes constructed from Markov chains.
نویسندگان
چکیده
The most commonly used models for analysing local dependencies in DNA sequences are (high-order) Markov chains. Incorporating knowledge relative to the possible grouping of the nucleotides enables to define dedicated sub-classes of Markov chains. The problem of formulating lumpability hypotheses for a Markov chain is therefore addressed. In the classical approach to lumpability, this problem can be formulated as the determination of an appropriate state space (smaller than the original state space) such that the lumped chain defined on this state space retains the Markov property. We propose a different perspective on lumpability where the state space is fixed and the partitioning of this state space is represented by a one-to-many probabilistic function within a two-level stochastic process. Three nested classes of lumped processes can be defined in this way as sub-classes of first-order Markov chains. These lumped processes enable parsimonious reparameterizations of Markov chains that help to reveal relevant partitions of the state space. Characterizations of the lumped processes on the original transition probability matrix are derived. Different model selection methods relying either on hypothesis testing or on penalized log-likelihood criteria are presented as well as extensions to lumped processes constructed from high-order Markov chains. The relevance of the proposed approach to lumpability is illustrated by the analysis of DNA sequences. In particular, the use of lumped processes enables to highlight differences between intronic sequences and gene untranslated region sequences.
منابع مشابه
Evaluation of First and Second Markov Chains Sensitivity and Specificity as Statistical Approach for Prediction of Sequences of Genes in Virus Double Strand DNA Genomes
Growing amount of information on biological sequences has made application of statistical approaches necessary for modeling and estimation of their functions. In this paper, sensitivity and specificity of the first and second Markov chains for prediction of genes was evaluated using the complete double stranded DNA virus. There were two approaches for prediction of each Markov Model parameter,...
متن کاملProbabilistic Sufficiency and Algorithmic Sufficiency from the point of view of Information Theory
Given the importance of Markov chains in information theory, the definition of conditional probability for these random processes can also be defined in terms of mutual information. In this paper, the relationship between the concept of sufficiency and Markov chains from the perspective of information theory and the relationship between probabilistic sufficiency and algorithmic sufficien...
متن کاملDistribution of First Passage Times for Lumped States in Markov Chains
First passage time in Markov chains is defined as the first time that a chain passes a specified state or lumped states. This state or lumped states may indicate first passage time of an interesting, rare and amazing event. In this study, obtaining distribution of the first passage time relating to lumped states which are constructed by gathering the states through lumping method for a irreduci...
متن کاملComputational Biology Lecture 9: CpG islands, Markov Chains, Hidden Markov Models HMMs
Given a DNA or an amino acid sequence, biologists would like to know what the sequence represents. For instance, is a particular DNA sequence a gene or not? Another example would be to identify which family of proteins a given protein (amino acid sequence) belongs to. In both cases above, we have a sequence of symbols from some alphabet and we are required to say something about the structure o...
متن کاملComputational Investigation on Structural Properties of Carbon Nanotube Binding to Nucleotides According to the QM Methods
The interaction between nucleotides and carbon nanotubes (CNTs) is a subjectof many investigations for treating diseases but there are many questions in this field thatremain unanswered. Because of experimental methods involve assumptions andinterpretation besides limitations, there are many problems that the best study for them isusing theoretical study. Consequently, t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of mathematical biology
دوره 52 3 شماره
صفحات -
تاریخ انتشار 2006